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O ,' Motivated by applications of rateless coding, decision feedback, and ARQ, we study 

.^ • the problem of universal decoding for unknown channels, in the presence of an erasure 

option. Specifically, we harness the competitive minimax methodology developed in 
earlier studies, in order to derive a universal version of Forney's classical erasure/list 
decoder, which in the erasure case, optimally trades off between the probability of era- 
sure and the probability of undetected error. The proposed universal erasure decoder 
guarantees universal achievability of a certain fraction £ of the optimum error exponents 
of these probabilities (in a sense to be made precise in the sequel) . A single-letter ex- 
pression for £, which depends solely on the coding rate and the threshold, is provided. 
The example of the binary symmetric channel is studied in full detail, and some con- 
clusions arc drawn. 
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1 Introduction 

When communicating across an unknown channel, classical channel coding at any fixed 
rate, however small, is inherently problematic since this fixed rate might be larger than the 
unknown capacity of the underlying channel. It makes sense then to try to adapt the coding 
rate to the channel conditions, which can be learned on-line at the transmitter whenever a 
feedback link, from the receiver to the transmitter, is available. 

One of the recent promising approaches to this end is rateless coding (see, e.g., [3], 
[4], [5], [11], [12], [13], [15], and references therein). According to this approach, there 
is fixed number of messages M, each one being represented by a codeword of unlimited 
length, in principle. After each transmitted symbol of the message selected, the decoder 
examines whether it can make a decision, namely, decode the message, with "reasonably 
good confidence," or alternatively, to request, via the feedback link, an additional symbol 
to be transmitted, before arriving at a decision. Upon receiving the new channel output, 
again, the receiver either makes a decision, or requests another symbol from the transmitter, 
and so on. 1 The coding rate, in such a scenario, is defined as log M divided by the expected 
number of symbols transmitted before the decoder finally commits to a decision. Clearly, 
at every time instant, the receiver of a rateless communication system operates just like 
an erasure decoder [7], which partitions the space of channel output vectors into (M + 1) 
regions, M for each one of the possible messages, and an additional region for "erasure," i.e., 
"no decision," which in the rateless regime, is used for requesting additional information 
from the transmitter. Keeping the erasure probability small is then motivated by the desire 
to keep the expected transmission time, for each message, small. Although these two criteria 
are not completely equivalent, they are nonetheless strongly related. 

This observation, as well as techniques such as ARQ and decision feedback, motivate 
us to study the problem of universal decoding with an erasure option, for the class of 
discrete memoryless channels (DMC's) indexed by an unknown parameter vector 9 (e.g., the 
set of channel transition probabilities). Specifically, we harness the competitive minimax 
methodology proposed in [6], in order to derive a universal version of Forney's classical 
erasure/list decoder. For a given DMC with parameter 6, a given coding rate R, and a 



Alternatively, the receiver can use the feedback link only to notify the transmitter when it reached a 
decision regarding the current message (and keep silent at all other times). In network situations, this would 
not load the network much as it is done only once per each message. 
2 See also [16], [1], [10], [9] and referecnes therein for later studies. 



given threshold parameter T (all to be formally defined later), Forney's erasure/list decoder 
optimally trades off between the exponent E\ (R, T, 9) of the probability of the erasure event, 
£i, and the exponent, E2(R,T, 9) = E±(R,T, 9) + T, of the probability of undetected error 
event, £2, in the random coding regime. 

The universal erasure decoder, proposed in this paper, guarantees universal achievability 
of an erasure exponent, Ei(R,T,9), which is at least as large as £ • Ei(R, T, 9) for all 9, for 
some constant £ G (0, 1], that is independent of 9 (but does depend on R and T), and at 
the same time, an undetected error exponent E2(R, T,9) > £ ■ Ei(R, T,9) +T for all 9 (in 
the random coding sense) . At the very least this guarantees that whenever the probabilities 
of £ 1 and £2 decay exponentially for a known channel, so they do even when the channel 
is unknown, using the proposed universal decoder. The question is, of course: what is the 
largest value of £ for which the above statement holds? We answer this question by deriving 
a single-letter expression for a lower bound to the largest value of £, denoted henceforth 
by £*(R,T), that is guaraneteed to be attainable by this decoder. It is conjectured that 
£*(R,T) reflects the best fraction of E±(R,T,9) (and of E2(R,T,9) in the above sense) 
that any decoder that is unaware of 9 can uniformly achieve. Explicit results, including 
numerical values of £*(i?, T), are derived for the example of the binary symmetric channel 
(BSC), parameterized by the crossover probability 9, and some conclusions are drawn. 

The outline of the paper is as follows. In Section 2, we establish the notation conventions 
and we briefly review some known results about erasure decoding. In Section 3, we formulate 
the problem of universal decoding with erasures. In Section 4, we present the proposed 
universal erasure decoder and prove its asymptotic optimality in the competitive minimax 
sense. In Section 5, we present the main results concering the performance of the proposed 
universal decoder. Section 6 is devoted to the example of the BSC. Finally, in Section 7, 
we summarize our conclusions. 

2 Notation and Preliminaries 

Throughout this paper, scalar random variables (RV's) will be denoted by capital letters, 
their sample values will be denoted by the respective lower case letters, and their alpha- 
bets will be denoted by the respective calligraphic letters. A similar convention will apply 
to random vectors of dimension n and their sample values, which will be denoted with 
same symbols in the bold face font. The set of all n-vectors with components taking 



values in a certain alphabet, will be denoted as the same alphabet superscripted by n. 
Thus, for example, a random vector X = (Xi, . . . , X n ) may assume a specific vector value 
x = (xi, . . . ,x n ) G X n as each component takes values in X. Channels will be denoted 
generically by the letter P, or P$, when we wish to emphasize that the channel is indexed 
or parametrized by a certain scalar or vector 6, taking on values in some set G. Information 
theoretic quantities like entropies and conditional entropies, will be denoted following the 
usual conventions of the information theory literature, e.g., H(X), H(X\Y), and so on. The 
cardinality of a finite set A will be denoted by \A\. 

Consider a discrete memoryless channel (DMC) with a finite input alphabet X, finite 
output alphabet y, and single-letter transition probabilities {P(y\x), x € X, y € y}. 
As the channel is fed by an input vector x € X n , it generates an output vector y £ y n 
according to the conditional probability distribution 

n 

P(y\x) = l[P(y i \x l ). (1) 

i=l 

A rate-i? block code of length n consists of M = e n-vectors x m £ X n , m = 1,2,... , M, 
which represent M different messages. We will assume that all possible messages are a-priori 
equiprobable, i.e., P{m) = 1/M for all m = 1,2,..., M. 

A decoder with an erasure option is a partition of y n into (M+l) regions, IZq, 1Z\ , . . . , TZm- 
Such a decoder works as follows: If y falls into lZ m , m = 1,2, .. . , M, then a decision is 
made in favor of message number m. If y € TZq, no decision is made and an erasure is 
declared. We will refer to IZq as the erasure event. 

Given a code C = {x±, . . . , xm} and a decoder??. = (7Zo,lZi, . . . , 7Z m ), let us now define 
two additional undesired events. The event £ i is the event of not making the right decision. 
This event is the disjoint union of the erasure event and the event £2, which is the undetected 
error event, namely, the event of making the wrong decision. The probabilities of all three 
events are defined as follows: 

M 

M 



p r{£i} = 1J7 E E p (y\ x r 



m=1 2/67^ 
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Pr i^} = mY, Y, Y, PivM (3) 

m=l y&TZ m m'^m 

Pr{7e } = Pr{£!} - Pr{£ 2 }. (4) 

Forney [7] assumes that the DMC is known to the decoder, and shows, using the Neyman- 



Pearson methodology, that the best tradeoff between Pr{£i} and Prj^} (or, equivalently, 
between Pr{7Zo} and Prj^}) is attained by the decoder 1Z* = (1Zq,TZ\, . . . ,7Z*m) defined 
by 
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where {TZmY ls the complement of lZ* m , and where T > is a parameter, henceforth referred 
to as the threshold, which controls the balance between the probabilities of E\ and £2- 

Forney devotes the remaining part of his paper [7] to derive lower bounds to the random 
coding exponents (associated with 1Z*), E±(R,T) and E2(R, T), of Pr{£i} and Prj^}, 
the average 3 probabilities of £\ and £2, respectively, and to investigate their properties. 
Specifically, Forney shows, among other things, that for the ensemble of randomly chosen 
codes, where each codeword is chosen independently under an i.i.d. distribution Q n {x) = 

n?=iQ(*o, 



E\(R, T) = max max[£'o(s, p, Q) — pR — sT] 

0<s<p<l Q 



where 



E Q (s,p,Q) = -In 



]T (j2 Q(x)P 1 - s (y\x)) ■ I £ Q(x')P s/p { 
y&y \xex J \x'ex 



and 



E 2 (R,T) = E 1 (R,T)+T. 



(6) 



(7) 



(8) 



A simple observation that we will need, before passing to the case of an unknown channel, 
is that the same decision rule TZ* would be obtained if rather than adopting the Neyman- 
Pearson approach, one would consider a Lagrange function, 
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r(C J 7i) = Pr{f 2 } + e- nJ Pr{fi}, 



(9) 



for a given code C = {x\, . . . , Xm} and a given threshold T, as the figure of merit, and 
seek a decoder 1Z that minimizes it. To see that this is equivalent, let us rewrite T(C,1Z) as 
follows: 
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(10) 



Here, "average" means w.r.t. the ensemble of randomly selected codes. 



and it is now clear that for each m, the bracketed expression (which has the form of weighted 
error of a binary hypothesis testing problem) is minimized by TZ^ as defined above. Since 
this decision rule is identical to Forney's one, it is easy to see that the resulting exponential 
decay of the ensemble average 

E{T(C, 11*)} = P7{£ 2 } + e-^Pr"!^} 

is E2(R, T), as Pr{£i} decays according to e - nE i{ R > T ) ; Pr{£ 2 } decays according to e - nE 2(R, T ) ^ 
and E2(R,T) = E±(R,T) + T, as mentioned earlier. This Largrangian approach will be 
more convenient to work with, when we next move on to the case of an unknown DMC, 
because it allows as to work with one figure of merit instead of a trade-off between two. 

3 Unknown Channel — Problem Description 

We now move on to the case of an unknown channel. While our techniques can be applied to 
quite general classes of channels, here, for the sake of concreteness and conceptual simplicity, 
and following in [7], we confine attention to DMC's. Consider then a family of DMC's 
{Pg(y\x), x £ X , y £ y, 9 £ 0}, where 9 is the parameter, or the index of the channel in 
the class, taking values in some set O. For example, 9 may be a positive integer, denoting 
the index of the channel within a finite or a countable index set. As another example, 9 
may simply represent the set of all \X\ ■ (\y\ — 1) single-letter transition probabilties that 
define the DMC, and if there are some symmetries (like in the BSC), these reduce the 
dimensionality of 9. The basic questions are now the following: 

1 . How to devise a good erasure decoder when the underlying channel is known to belong 
to the class {Pg(y\x), x £ X, y £ y, 9 £ 6}, but 9 is unknown? 

2. What are the resulting error exponents of £\ and £2 and how do they compare to 
Forney's exponents for known 9? 

In the quest for universal schemes for decoding with an erasure option, two difficulties 
are encountered in light of [7]. The first difficulty is that here we have two figure of mer- 
its, the probabilities of £\ and £2- But this difficulty can be alleviated by adopting the 
Lagrangian approach, described at the end of the previous section. The second difficulty 
is somewhat deeper: Classical derivations of universal decoding rules for ordinary decoding 
(without erasures) over the class of DMC's, like the maximum mutual information (MMI) 
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decoder [2] and its variants, were based on ideas that are deeply rooted in considerations of 
joint typicality between the channel output y and each hypothesized codeword x m . These 
considerations were easy to apply in ordinary decoding, where the score function (or, the 
"metric") associated with the optimum maximum likelihood (ML) decoding, log Pg(y\x m ), 
involves only one codeword at a time, and that this function depends on x m and y only 
via their joint empirical distribution, or, in other words, their joint type. Moreover, in the 
case of decoding without erasures, given the true transmitted codeword x m and the result- 
ing channel output y, the scores associated with all other randomly chosen codewords, are 
independent of each other, a fact that facilitates the analysis to a great extent. This is very 
different from the situation in erasure decoding, where Forney's optimum score function for 

each codeword, 

Pe(y\x m ) 



depends on all codewords at the same time. Consequently, in a random coding analysis, it 
is rather complicated to apply joint typicality considerations, or to analyze the statistical 
behavior of this expression, let alone the statistical dependency between the score functions 
associated with the various codewords. 

This difficulty is avoided if the competitive minimax methodology, proposed and de- 
veloped in [6], is applied. Specifically, let Tq(C,1Z) denote the above defined Lagrangian, 
where we now emphasize the dependence on the index of the channel, 9. Let us also define 
T*q = -Elmin-fc Tg(C, 7£)}, i.e., the ensemble average of the minimum of the above Lagrangian 
(achieved by Forney's optimum decision rule) w.r.t. the channel {Pg(y\x)} for a given 9. 
Note that the exponential order of f % is e -n[Ei(R,T,e)+T] = e -nE 2 (R,T,6) ^ where Ei(R,T,9) 
and E2(R, T, 9) are the new notations for E\(R, T) and E2(R, T), respectively, with the de- 
pendence on the channel index 9, made explicit. In principle, we would have been interested 
in a decision rule 1Z that achieves 

Te{C,K) 
mm max = , (11) 

or, equivalently, 

To(C,K) 

Tg r n[ El («,T, 9 ) +r p ( 12 ) 

but as is discussed in [6] (in the analogous context of ordinary decoding, without erasures), 
such an ambitious minimax criterion of competing with the optimum performance may be 
too optimisitic. A better approach would be to compete with a similar expression of the 
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exponential behavior, but where the term E\ (R, T, 9) is being multiplied by a constant 
£ £ (0,1], which we would like to choose as large as possible. In other words, we are 
interested in the competitive minimax criterion 

K n {C) = mm ™ e . B[ ^ (w)+I1 . (13) 

Similarly as in [6], we wish to find the largest value of £ such that the ensemble average 
K n = E{K n (C)} would not grow exponentially fast, i.e., 

lim sup - log K n < 0. (14) 

n^oo Tl 

The rationale behind this is the following: If K n is sub-exponential in n, for some £, then 
this guarantees that there exists a universal erasure decoder, say 1Z, such that for every 
9 E 0, the exponential order of E{Tq(C,1Z)} is no worse than e~ n ^ x ' ' ' > + >. This, in 
turn, implies that both terms of Yq{C,TZ) decay at least as e~ n ^ El ^ R ' ' > + ' , which means 
that for the decoder 7Z, the exponent of Pr{£i} is at least £ • Ei(R, T, 9) and the exponent 
of Prj^} is at least £ • Ei(R, T, 9) + T, both for every # G 0. Thus, the difference between 
the two (guaranteed) exponents remains T as before (as the weight of the term Pr{£i} in 
T(1Z,C) is e~ nT ), but the other term, Ei(R,T,9), is now scaled by a factor of £. 

The remaining parts of this paper focus on deriving a universal decoding rule that 
asymptotically achieves K n {C) for a given £, and on analyzing its performance, i.e., finding 
the maximum value of £ such that K n still grows sub-exponentially rapidly. 

4 Derivation of a Universal Erasure Decoder 

For a given £ € (0, 1], let us define 

f(x m ,y) ± max{e"^W^P 9 ( 2/ |x m )} (15) 

and consider the decoder 

K m = \y: H*">V\ > eA , m = l,2,...,M 

M 

Tto = fl ^m- (16) 



Denoting 



K n (C,K) = m^x e _ n[( e Ei(RTe)+T] , (17) 



for a given encoder C = {x±, . . . ,xm} and decoder 1Z, our first main result establishes 
the asymptotic optimality of 1Z in the competitive minimax sense, namely, that K n (C,lZ) 
is within a sub-exponential factor as small as K n (C) = min^. K n (C, TZ)}, and therefore, 
E{K n (C,7Z)} is within the same sub-exponential factor as small as K n = E{K n (C)}. 

Theorem 1 For every code C, 

K n (C, TZ)<{n + l)\ x ^ y \- l K n {C). (18) 

Comment: Note that the summation J2 m '^m f( x m',y) might pose some numerical chal- 
lenges since it is a summation of many terms within a potentially large range of order of 
magnitudes. An asymptotically equivalent version of 7Z, that avoids such summations al- 
together, is the following. Let M{a) be the number of {x m >} for which f(x m ',y) = a. 
Since f(x m ',y) depends on (x m /,y) only via their joint empirical distribution (see the 
proof of Theorem 1, next), then the number of possible values of a is at most polynomial 
in n. Then, J2 m '^m f( x m',y) can be replaced by max a [a • M(a)], without affecting the 
asymptotic optimality. 

Proof. The proof technique is similar to that of [6]. As x and y exhaust their spaces, 
X n and y n , let B n denote set of values of 9 that achieve {f(x,y), x G X n , y € y n }. 
Observe that for every 6, the expression [e n ^ El ^ R,T,0 ' +T 'Pg(y\x)] depends on (x,y) only 
via their joint empirical distribution (or, the joint type). Consequently, the value of 6 that 
achieves f(x, y) also depends on (a;, y) only via their joint empirical distribution. Since the 
number of joint empirical distributions of (x, y) never exceeds (n+ 1)I'*'H3 ; I _1 (see [2]), then 
obviously 

|6 n |<(n + ljIWI" 1 (19) 

as well. Now, for every encoder C and decoder 1Z, 

r e (c,n) 



K n (C,TZ) = max 



ee0 e -T»K£i(fl,T,*)+r| 

-. M 

max — - > 
eee M ^ 

m=l 



y- y- Pe(y\x m >) _ nT ^ P e (y\x m ) 

2^ 2^ p ~n[^E l (R,Tfi)+T} +e 2^, p -n[^E l (R,Tfi)+T} 



. M 

< — y 

ra=l 

-. M 

= —y 

m=l 



y- y- Pe(y\x m >) _ nT y- Pg(y\x m ) 

2^ 2^ "f^ n^E l (R,T,e)+T] +e 2^ ^ e - n \iE 1 {R,T,e)+T\ 

y y f(x m ,,y) + e- nT Yl f( x m,y) 



K n (c,n) 

-. M 

— Y 



< 
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M 



m=l 
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m=l 



E E 



max 



Pe{y\x r , 



eee n e -n[££i(ii,T,0)+T] 



+ e 



-nT 
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yen* 



max 



Pe(y\x r 



0G0„ e -n[£,E 1 {R,Tfi)+T\ 



E E E 

y£Tl m m'^m y?ee„ 

ye^ \eee„ 



•Ped/lscr 



-n[££i(fl,T,0)+T] 
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y — y 

SG6 n m=\ 



e -n[(,E 1 {R,Tfi)+T\ 

Pe{y\x r . 



-nT 
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yen* 
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Pe{y\x m ) 

e -n[(,E 1 (R,Tfi)+T] 



e -n[£,Ei{R,T,6)+T\ 
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Pe(y\x r 
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e -n[££i(fl,T,0)+T] 
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e -n[££i(fl,T,0)+T] 



+ 



-nT 



E 



-P<?(y|a5m) 



e -n[£Ei(fl,r,6>)+T] 






(20) 



Thus, we have defined K n (C, TZ) and sandwiched it between K n (C,TZ) and (n + 1)I'*'H3 ; I _1 • 
K n (TZ,C) uniformly for every C and TZ. Now, obviously, TZ minimizes K n (C, TZ), and so, for 
every TZ, 

K n {C,TZ) <K n (C,TZ) <K n (C,TZ) < {n+yWW-i-KniCK), (21) 

where the first and the third inequalites were just proved in the chain of inequalities (20), 
and the second inequality follows from the optimality of 7£ w.r.t. K n (C,TZ). Since we have 
shown that 

K n {C,K) <(n+ 1)1 WM. K n (C,1Z) 

for every TZ, we can now minimize the r.h.s. w.r.t. TZ and the assertion of Theorem 1 is 
obtained. This completes the proof of Theorem 1. 
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5 Performance 

In this section, we present an upper bound to K n from which we derive a lower bound to 
£*, the largest value of £ for which K n is sub-exponential in n. 

Given a distribution P y on y, a positive real A, and a value of 0, let 

F(P y ,\,0) = ]n\X\-max[H(X\Y) + \E1nP e (Y\X)], (22) 

where E{-} is the expectation and H(X\Y) is the conditional entropy w.r.t. a generic joint 
distribution P xy (x,y) = P y {y)P x \ y {x\y) of the RV's (X,Y). Next, for a pair (0,0) € B 2 , 
and for two real numbers s and p, < s < p < 1, define: 

E(0, 0, p, s) = imn[F(P y , 1-8,0) + P F(P y , s/p, 0) - H(Y)}, (23) 

where H(Y) is the entropy of Y induced by P y . Finally, let 

t*(T?rr\ A ■ E(0,0,s,p)- pR-sT 

f (K, 1 ) = mm max — , 24 

e,§ o<s< P <i(l-s)E 1 (R,T,0) + sE 1 (R,T,0) 

with the convention that if the denominator vanishes, then £*(R,T) = 1. Our main result, 
in this section is the following: 

Theorem 2 Consider the ensemble of codes where each codeword is drawn independently, 
under the uniform distribution Q{x) = l/\X\ n for all x. Then, 



1. For every £ <£*(R,T), 



lim sup — log K n < 0. 



2. For £ = £*(R,T), the average probability of 8 1 and the average probability of 62, 
associated with the decoder 1Z, decay with exponential rates at least as large as £*(R, T)- 
Ei(R, T, 0) and £*(R, T) • E ± (R, T, 0) + T, respectively, for all € 9. 

The proof of Theorem 2 appears in the appendix. 

We now pause to discuss Theorem 2 and some of its aspects. 

Theorem 2 suggests a conceptually simple strategy: Given R and T, first compute 
£*(R,T) using eq. (24). This may require some non-trivial optimization procedures, but it 
has to be done only once, and since this is a single-letter expression, it can be carried at 
least numerically, if closed-form analytic expressions are not apparent to be available (see 
the example of the BSC below). Once £,*(R, T) has been computed, apply the decoding rule 
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1Z with £ = £*(R, T), and the theorem guarantees that the resulting random coding error 
exponents of E\ and £2 are as specificed in the second item of that theorem. 

The theorem is interesting, of course, only when £*(R,T) > 0, which is the case in 
many situations, at least as long as R and T are not too large. When t;*(R,T) > 0, the 
proposed universal decoder with £ = £*(R,T) has the important property that whenever 
Forney's optimum decoder yields an exponential decay of Pr{£i} (Ei(R, T, 9) > 0), then so 
does the corresponding exponent of the proposed decoder, 1Z. It should be pointed out that 
the exponential rates £*(R,T) ■ E±(R,T,9) and £*(R,T) ■ E±(R,T,9) + T, guaranteed by 
Theorem 2, are only lower bounds to the real exponential rates, and that true exponential 
rate, at some points in 0, might be larger. 

The derivation of S,*(R,T) is carried out (see the appendix) using the same bounding 
techniques as in Gallager's classical work and as in [7], which are apparently tight in the 
random coding sense. We therefore conjecture that £*(-R, T) is not merely a lower bound 
to the best achievable fraction of Ei(R,T,9) that is universally achievable, but it actually 
cannot be improved upon. If this conjecture is true, it means that unlike the case of ordinary 
universal decoding (without erasures), where the optimum random coding error exponent 
is universally achievable over the DMC, i.e., £* = 1 [2], [17], here, when erasures are brought 
into the picture, this is no longer the case, as £*(R,T) is normally striclty less than unity, 
as we demonstrate later in the example of the BSC. We will also demonstrate, in this 
example, that for the case T = 0, which is asymptotically equivalent to the case without 
erasures in the sense that E\(R, 0, 9) = E2(R, 0, 9) coincide with Gallager's random coding 
exponent [8] (although erasures are still possible), we get £,*(R, 0) = 1, in agreement with 
the aforementioned full universality result for ordinary universal decoding. 

In Theorem 2, we assumed that the random coding distribution Q is uniform over 
X n . This assumption is reasonable since, in the absence of any prior knowledge about the 
channel, no vectors in X n appear to have any preference over other vectors (see also [14] for 
another justification). It is also relatively simple to analyze the random coding performance 
in this case. It is straightforward, however, to modify the results to any random coding 
distribution Q(x) which depends on x only via its empirical distribution (for example, any 
other i.i.d. distribution, or a uniform distribution within one type class). This can easily be 
done using the method of types [2] (see the appendix). 

Our last comment concerns the choice of the threshold T. Thus far, we assumed that 
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T is a constant, independent of 6. However, in some situations, it makes sense to let T 
depend on the quality of the channel, and hence on the parameter 6. Intuitively, for fixed 
T, if the signal-to-noise ratio (SNR) becomes very high, the erasure option will be used so 
rarely, that it will effectively be non-existent. This means that we are actually no longer 
"enjoying" the benefits of the erasure option, and hence not the gain in the undetected 
error exponent that is associated with it. An alternative approach is to let T = Tg depend 
on 6 in a certain way. In this case, K n {C) would be redefined as follows: 

K n {C) = mm max e -nm(R,Te,e) + T e] ■ ( 25 ) 

The corresponding generalized version of the competitive minimax decision rule 1Z, would 
now be: 

n m = ly •■ g(x m ,y)> Yl H x m>,y)>, m = i,...,M 

M 

Uo = n #m. (26) 



m=l 



where 



and 



g(x m ,y) = max[P e (y\x m ) ■ e < E ^ T ^\ (27) 



h(x m , y) = m a x[P e (y\x m ) ■ e n[^{R,T e fi) + T e ]^ (2g) 

6 

By extending the performance analysis carried out in the appendix, the resulting expression 
of £* now becomes 

wm A . E(d,d,s,p)-pR-sT § 

f (K) = mm max — . 29 

g> § o<s< P <i(l-s)E 1 (R,T e ,6) + sE 1 (R,T § ,e) 

The main question that naturally arises, in this case, is: which function Tg would be rea- 
sonable to choose? A plausible guideline could be based on the typical behavior of 

.. 1 _. Pe(Y\x m ) 

tq = lim —ki in ■ 



which can be assessed, using standard bounding techniques, under the hypothesis that x m 
is the correct message. For example, Tg may be given by arg with some constant a € [0, 1], 
or Tg — (3 for some (3 > 0. This will make the probability of erasure (exponentially) small, 
but not too small, so that there would be some gain in the undetected error exponent for 
every 9. 
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6 Example — the Binary Symmetric Channel 

Consider the BSC, where X = y = {0, 1}, and where 9 designates the crossover probability. 
We would like to examine, more closely, the expression of £*(R,T) and its behavior in this 
case. Let h.2{u) denote the binary entropy function, — ulnu — (1 — u) ln(l — u), u G [0,1]. 
Denoting the modulo 2 sum of X and Y by X © Y, we have: 



F(Py,\,< 



In 2 - max[H(X\Y) + XElnP(Y\X)] 



In 2 - max i H(X\Y) + XE In 



x®y 



= ln2-Aln(l-6> 

= ln2-Aln(l-6> 

> ln2-Aln(l-6> 

= ln2-Aln(l-6> 

= ln2-Aln(l-fl 



max 
p 



max 
p 

1 Ay 



max 



max 



H(X\Y) + (\ln- 



E{X®Y) 



H{X@Y\Y) + {\\n- 



H{X@Y) + {\\n 



1 



E{X®Y) 



E(XeY) 



h 2 (u) + (Xln 



e 



1 



In 



1 + 



ln2-ln[# A + (l-fl) A ], 



(30) 



where the inequality is, in fact, an equality achieved by a backward P x \ y where X © Y is 
independent of Y. Since F(P y , A, 9) is independent of P y , this easily yields 



E(9, 9,p,s)=pln2- ln^ 1 "" + (1 - 9) 1 - 8 ] - p ln[^ + (1 - 9) s ' p ] 



(31) 



and so, 



£*(R,T) = min max 
6,0 0<s<p<l 



with 



pln2 - lnf^ 1 -" + (1 - 9) 1 - 3 ] - pln[9 s / p + (1 - 0) s/p ] - pR-sT 



{l-s)E 1 (R,T,9) + sE 1 (R,T, 



(32) 



E 1 (R,T,9)= max {pln2-ln[9 1 - s + {l-9) 1 - s ]-pln[9 s/p + {l-9) s/p ]}. (33) 

0<s<p<l 

This expression, although still involves non-trivial optimizations, is much more explicit 
than the general one. We next offer a few observations regarding the function £*(R,T) for 
the example of the BSC. 
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First, observe that if G is a singleton, i.e., we are back to the case of a known channel, 
then 9 = 9, and the numerator, after maximization over p and s, becomes E\(R,T,9), and 
so does the denominator, thus £*(R,T) = 1, as expected. Secondly, we argue that there 
exists a region of R and T (both not too large) such that £*(R,T) > 0. To see this, note 
that there are four possibilities regarding the minimizers 9 and 9 in the above minimax 
problem: 

1. 9 = 9 = 1/2: In this case, the denominator vanishes too and so, £*(R,T) = 1. 

2. Both 9 / 1/2 and 9 / 1/2: Let 9 be the closer to 1/2 between 9 and 9. Then, the 
numertor is obviously lower bounded by 

pin 2 - ln[0 1_s + (1 - 9) 1 - 3 } - pln[9 s/p + (1 - 9) s/f) ] - pR - sT, 

which upon maximizing over p and s gives E\(R, T, 9), which is positive as long as R 
and T are not too large. 

3. 9 = 1/2 and 9 ^ 1/2: In this case, the numerator is given by 

pln2 - pln[9 s/p + (1 - 9) s/p ] - pR - s(T + In 2). 
Choosing p = 1 and s = 1/2, we get 



- In 2 - In 
2 



^+27' 



! 9+\Jl- 

which is positive as long as R and T are not too large. 

4. 9 7^ 1/2 and 9 = 1/2: In this case, the numerator is given by 

s In 2 - lnf^ 1 ^ + (1 - 9) l - s ] - pR - sT, 

and once again, choosing p = 1 and s = 1/2 gives exactly the same expression as in 
item 3, except that 9 replaces 9, and hence the conclusion is identical. 

We next demonstrate that £*(R, 0) = 1. Referring to the definition of the Gallager 
function E(9, p) for the BSC: 

E(0, p) = pin 2 - (1 + p) InpVa+p) + (i _ 0)V(i+p)] _ p R, (34) 
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R/T 


T = 0.000 


T = 0.025 


T = 0.050 


T = 0.075 


T = 0.100 


T = 0.125 


T = 0.150 


R = 0.00 


1.000 


0.364 


0.523 


0.418 


0.396 


0.422 


0.298 


R = 0.05 


1.000 


0.756 


0.713 


0.656 


0.535 


0.562 


0.495 


# = 0.10 


1.000 


0.858 


0.774 


0.648 


0.655 


0.585 


0.518 


# = 0.15 


1.000 


0.877 


0.809 


0.720 


0.713 


0.662 


0.622 


R = 0.20 


1.000 


0.905 


0.815 


0.729 


0.729 


0.684 


0.647 


# = 0.25 


1.000 


0.912 


0.832 


0.763 


0.706 


0.661 


0.627 


R = 0.30 


1.000 


0.896 


0.850 


0.788 


0.738 


0.644 


0.613 



Table 1: Numerical values of £*(#, T) for various values of R and T. 

let us define p' = 1/(1 — s) — 1 and p" = p/s — 1, and rewrite the numerator of the expression 
for £*(#, 0) as follows: 



p\n2-\n[6 L ~ s + {l 



p\n2-\n[6 1/( - 1+p,) + (l- 

-^—Ap 1 In 2 - (1 + (J) ln[0V(iV) + (1 

1 + p 



+ 



P' 
P 



x - s ] - p\n[9 s/p + (1 - 0) s/p ] - /># 

_ 0)i/(HV)] _ pin^/d+p") + (i _ 0)V(i+P")] 



p# 



a/(i+p')i 



-{p"ln2-(l+p")ln[9 



i/(i+p") 



1+/9' 

l-s)E(0,(/) + sE(9,ff' 

1 



+ (1 



a/(i+p")i 



/#} 



1-slB 



1 



1 + *£7 0, - - 1 



P 



(35) 



Now, let us choose s = p/(l + p), where p is the achiever of E*(9) = maxo<p<i E(9, p), and 
p = p*(l + p)/(l + p*), where p* is the achiever of E*(9) = maxo<p<i E(9,p) (observing that 
p*(l + /5)/(l + p*) < 1, therefore this is choice is feasible). With this choice, the numerator 
of £*(R,T) becomes equal to the denominator, and so, £*(R,T) = 1. 

Finally, in Table 1, we provide some numerical results pertaining to the function £*(#, T), 
where all minimizations and maximizations were carried out by an exhaustive search with a 
step-size of 0.01 in each dimension. As can be seen, at the left-most column, corresponding 
to T = 0, we indeed obtain £*(#, 0) = 1. As can also be seen, ^*(R,T) is always strictly 
less than unity for T > 0, and it in general decreases as T grows. 

7 Conclusion 

We have addressed the problem of universal decoding with erasures, using the competitive 
minimax methodology proposed in [6], which proved useful. This is in contrast to earlier 
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approaches for deriving universal decoders, based on joint typicality considerations, for 
which we found no apparent extensions to accommodate Forney's erasure decoder. In order 
to guarantee uniform achievability of a certain fraction of the exponent, the competitive 
minimax approach was applied to the Lagrangian, pertaining to a weighted sum of the two 
error probabilities. 

The analysis of the minimax ratio, K n , resulted in a single-letter lower bound to the 
largest universally achievable fraction, £*(R,T) of Forney's exponent. We conjecture that 
£*(R, T) is a tight lower bound that cannot be improved upon. In addition to the reasons we 
gave earlier, why we believe in this conjecture, we have also seen that it is supported by the 
fact that at least in extreme cases, like the case T = and the case where is a singleton, 
it gives the correct value £*(R, T) = 1, as expected. An interesting problem for future work 
would be to prove this conjecture. This requires the derivation of an exponentially tight 
lower bound to K n , which is a challenge. 

The analysis technique offerred in this paper opens the door to similar performance 
analyses of competitive-minimax universal decoders with various types of random coding 
distributions (cf. the second to the last paragraph of the discussion that follows Theorem 
2). This is in contrast to earlier works (see, e.g., [2], [17]), which were strongly based on the 
assumption that the random coding distribution is uniform within a set. A similar analysis 
technique can be applied also to universal decoding without erasures. 

Finally, we analyzed the example of the BSC in full detail and demonstrated that 
£*(R, 0) = 1. We have also provided some numerical results for this case. 

Appendix — Proof of Theorem 2 

For an event £ C y n , let l{y \£} denote the indicator function of £ , i.e., l{y|£} = 1 if y £ £ 
and l{y|£} = otherwise. First, observe that 

l{y\n m } = lly\f(x m ,y)>e nT ]T f(x m >,y) 

f(x m ,y) 



< min 

0<s<l 



e L^m'^m J \ x m' iV) 



and similarly, 

e L-im! '^m J \ x m' > V ') 



\{y £ U c m } < min 

TOJ 0<s<1 



f{x m ,y) 



(A.l) 



(A.2) 
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Then, we have: 



K n < 



E{K n (K,C)} 

e- nT Pr{^i} + Pr{^ 2 } 



E < max 



q e -n[£E 1 (R,Tft)+T\ 



< 



{, M 
m f4 E E *- nT {Po(y\X m ) ■ e »K*WWn) • l{y\K c m } + 

Y Po(y\X m ,)-e^(^)+A. 1{y lK m } 

{-. M 
if E E e^™ (P„(y|Jr m ) • e^^<M) • l{y|7C}+ 
m=iy£y n 

Y m^[P e (y\X m ,) ■ e nltMR,T,e) +T]] j . 1{y |^ } 

Hif E E e-™ T /(x m ,y).i{ 2/ |^}+( 2 /(*m',v)J -i{y|^ m } 

[ m=l ye^ n \mVm / 



(b) 

< 



W^E E e" nT f(X m ,y). 



M 



m =\ yey 



mm 

0<s<l 



ym'^m 



f(X m ,y) 



f(x m ,y) 

1-s' 



+ 



e^EmVm/^™''^ 



ili 



^^EE ^ e^M/i^x^) E /(Xm , jy 



=iyey 



m'^m 



M 



E{i-Y Y min e"^ 1 -^) max P e (y|X m )e^ El 
m=i y^y n \ 



(R,T,0)+T] 



1-s 



Y m a xP § (y\Xm>)e^W^ 



\m'^m 



(c) 

< 



eee 



iU 



El — y^ y^ min < ( 
IM'H ^ o<s<i 
m =i yey n 



-nT(l-s) 



maxP,(y|X m )e n K £l ^ T ' e )+ T ] 



eee 



1-s 



Y Y p 9 (v\x«*)* v * El{R '™ T[ 



M 



E { 17 E E min < e 

m =\ yey 



-nT(l-s) 



maxP e (y|X m )e"^ 1 ^ T ' e )+ T ] 



eee 



1-s 



Y Y P~ e {y\Xm,)e^w~e)+n 
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(d) 



2 
M 



M 



< ^-E E 

m =i ye,y n 



mm < e 
0<s<l 



-nT(l-s) 



maxP,(y|X m )e"^^ e ) +T ] 

#GOn 



1-S 



|0„|-max £ P^(y|X m ,)e"^ l( ^ e > Tl 
m r 



< E 



2ie 



— — - y^ y^ min < e' 
M ^ ^ o<s<i 
m=i yey™ I. 



-nT(l-s) 



maxP e (y|X m ) e ™^( R ^)+ T ] 
<?ee„ 



1 l-s 



max 2 n~(y|* m Oe^ l( ™ )+Tl 

< ^ < ^ E E E E ^/ ( 



M 



m =iyey n eee n § (i Q n 



-nT(l-s) 



P,(y|X m )e"^^ e ) +T ] 



l-s 



E ^(*/l^)e^ l(i ^ )+T1 
2\e n 



M 



, _ y^ y^ y^ y^ IS min ^ e 



-raT(l-s) 



P e (y|X m )e"^( R ' T ' e ) +T ] 



l-s 



(f) olft 13 M f 

< ^- max max V Y E min e^ 1 "*) 



v m'y^m 



Pe(y\X m )e^ R ™^ 



l-s 



< 



2IB I 3 

M 



iU 



m=i yey n 



maxmax min V V E \ e' nT{1 ' s) P e (y\X m )e n ^ El 



[(i?,T,6»)+T] 



E W^mO^*^^ 

2\(r) I 3 



maxmax mm e 



ra[£{(l-s)£i(fl,T,0)+s£i(ii,T,0)}+sT] 



M eee fj ee o<s<i 



M 



J2 E Eipi' s {y\x m ). J2 p§(y\ x ™>) 

m=i yey™ [ \m'^m 



(A.3) 



where (a) follows from the fact that the maximum (over 6) of a summation is upper bounded 
by the summation of the maxima, (b) follows from (A.l) and (A. 2), and (c), (d), (e) and 
(f) all follow from the fact that if g{9) is non-negative then 



max 5 (#) < ]T g{6) < |9 n | • max 5 (0). 

otHn „_^ WfcC'n 



6>G0„ 
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Assuming that the codewords are drawn independently, we then have: 



K n < 



2IGJ 



max max mm e 
M e&e §ee o<s<i 



n[£{(l-s)E 1 (R,T,e)+sE 1 (R,T,e)}+sT] 



M 



Y Y E{P l e - s (y\X m )}-El ]T P § (y\X m 
2|0n|3 maxmax min e n\m-)ih{R,Tfi)+,Ei{R,T,8)}+*r\ x 



M eee §ee o<*<i 



a; 



E Y ^{^" S (?/l^m)}- mm ^ 
m=iyey n 



E W*r 

m'y^m 




< 



216 |3 



n| maxmax min e «K{(i-)£i(^) + ^ l(OT )} +s T] x 



M see gge o<s<i 



a; 



£ E E i P 0~ S (y\X m )}- o< rmnE 

rn =i y^y n — 



E ^"(yl*. 



< 



2\e r 



max max mm e 

M eee e ee o<s<p<i 



n[£{(l-s).Ei(.R,T,0)+sEi(fl,T,0)}+sT] 



M 



Y Y E{P l e - s (y\X m )}.\ Y E{Pl ,p {y\X m ,)} 

m=iy&y n \m'^m 



(A.4) 



where in the last step we have used Jensen's inequality. Now, observe that the summands 
do not depend on m, therefore, the effects of the summation over m and the factor of 
\/M cancel each other. Also, the sum of M — 1 contributions of identical expectations 
E{P~ (y\X m i)} creat a factor of M — 1 (upper bounded by M) raised to the power of p. 
Denoting 

U(y,X,0) = E{P e x (y\X)}, 

we have: 

K n < 2|6 n | 3 maxmax min MP- e «lt{(i-s)Ei{R,T,e)+sE 1 (R,T,§)}+'(r\ 



flee g e Q o<s< P <i 

Y U(y,l-s,e)-UP(y,s/p,§). 

y&y n 



(A.5) 



To compute U(y, A, 9), we use the method of types [2]. Now, Q is assumed i.i.d. and uniform 
over the entire input space, i.e., Q{x) = l/\X\ n for all x. Let Pxy denote the empirical 
joint distribution of (x,y) and let Exy{-} denote the corresponding empirical expectation, 
i.e., the expectation w.r.t. Pxy- Also, let T{x\y) denote the conditional type class of x 
given y, i.e., the set of x' with Px'y = Pxy and let H X y(X\Y) denote the corresponding 
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empirical conditional entropy of X given Y . Then, 



1 ' xex n 

= J_ y lT{xM . e XnE X ylnP 6{ Y\X) 

T(x\y)cx n 

< _J_ V^ e nH xy (X\Y) e \nE xy hi P e (Y\X) 

— \X\ n £—' 

T(x\y)cx n 

< (n + 1) |yKW-l). e -nF(P y ,A,e) ) (A6) 

where F(P y , A, (9) is defined as in eq. (22). On substituting this bound into the upper bound 
on K n (JZ), we get: 

K n < 2|9 n | 3 (n+ l) 2|:v| - (|A ' l ~ 1) maxmax min 

dee e e @ o<s< P <i 

M P . e n^{{l-s)E 1 {R,T,6)+sE 1 {R,T,e)}+sT\ . V- e ~n\F{Py ,l-s,9)+pF(P y ,s/p,8)} 

yay n 

< 2|9J 3 (n + l) 2|:v| ' (|A ' l_1) maxmax min 

eee g G e o<s<p<i 

MP ■ e n[H^s)E 1 (R,T,e)+sE 1 (R,T,e)}+sT] . V- e nHy(Y) _ e -ra[F(Py,l-s,0)+,9F(Py,s/p,0)] 

< 2|e n | 3 (n + l) 3|:V| ' (|A ' l_1) maxmax min 

M P . e nK{(l- S )Ei(P,T,e)+ S Bi(P,T,0)} +S T] . e -nmin JV [F(P„ 1 l- a ,«)+pF(P I „ a /p,fl)-fr(y)] 

< 2|e n | 3 (n + l) 3|:V| ' (|A ' l_1) maxmax min 

eee § € q o<s< p <i 

M P . e n[a(l-s)E 1 (R,T,e)+sE 1 (R,T,e)}+sT] . e -nE{0,0,a,p) _ ( A? ) 

We would like to find the maximum value of £ such that K n would be guaranteed not to 
grow exponentially. To this end, we can now ignore the factor 2|G n | 3 (n+l) 3 '^'''' x \~ l \ which 
is polynomial in n (cf. eq. (19)). Thus, the latter upper bound will be sub-exponential in 
n as long as 

min max [E(9,9,s,p) - f{(l - s)E 1 (R,T,9) + sEi(R,T,9)\ - pR - sT] > 0, (A.8) 

9,6 0<s<P<l 

or, equivalently, for every (6,9), there exist (p, s), < s < p < 1, such that 

E{9,9,s,p) > £{(1 - s)Ei(R,T,9) + sEi(R,T,9)} + pR + sT, (A.9) 



i.e., 



£< E(e,e,s,p)- P R-8T 

(l-s)E l {R,T,6) + sE l {R,T,9) 
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In other words, for every £ < £*(R,T), where £*(R,T) is defined as in eq. (24)), K n (7Z) is 
guaranteed not to grow exponentially with n. This completes the proof of Theorem 2. 
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